This notebook takes the reader through the IceCube Neutrino Observatory dataset presented by Particle Physics Playground. The dataset is a record of 10 events of neutrino flux observations made by IceCube on Earth, which quantifies the parameters of optical signal and cartesian coordinates associated with the various detection events. The detector itself monitors light emission due to Cherenkov radiation generated by interactions of high energy neutrinos with atoms in the ice. When neutrinos pass through the observatory and induce interactions, relativistic elementary particles are released. Within the ice, these elementary particles can move faster than the local speed of light. When this happens, similar to when airplanes move at mach speed and produce a sonic boom, these faster-than-light particles create light booms known as Cherenkov radiation. Detectors, that are lined within the observatory, made of boroscilicate glass serve as photosensitive modules that detect this radiation. Over the course of 10 separate detection events, recording light generated by different flavours of neutrinos, the entire dataset is compiled. This investigation seeks to be able to classify neutrino flavour based on parameters such as optical signal and frequency of detected hits during an event. What can be deduced from the contents of this analysis is that the results suggest that this is possible, but would of course require a more thorough analysis to conclude.
This cell loads in the neccessary libraries such as pandas, seaborn, sklearn modules, etc. These libraries are used extensivley throughout this notebook
#CITE: Particle Physics Playground
import sys
import matplotlib
import matplotlib.path as mpath
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, show, axis
from matplotlib.patches import Ellipse
import scipy
import numpy as np
from numpy import *
import math
import pylab
import random
from pylab import *
from astropy.coordinates import SkyCoord
from astropy import units as u
import pylab as P
from astropy.io import ascii
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, k_means
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, LassoCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
from matplotlib import MatplotlibDeprecationWarning
warnings.filterwarnings(
"ignore",
category=MatplotlibDeprecationWarning
)
matplotlib.style.use('ggplot')
sns.set_style('darkgrid')
%matplotlib inline
%config InlineBackend.figure_format ='retina'
This cell reads in the data. This step is provided by the Particle Physics Playground. An addition made to this is converting the ratio of "hits per parameter" into units by engineering columns. I also turn everything into one large dataframe to monitor the neutrino hits in all events simulteneously.
# IMPORT DATA STEP (This step is provided by the Particle Physics Playground)
import pandas as pd
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
!pip install -q git+https://github.com/mattbellis/h5hep.git
!pip install -q git+https://github.com/mattbellis/particle_physics_simplified.git
import pps_tools as pps
import h5hep
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
infilename = 'icecube_events_small.h5'
pps.download_from_drive('icecube_events_small.h5')
infilename = 'data/' + infilename
all_data = pps.get_all_data(infilename)
# This fills 'event' with one entry from 'data'
event = pps.get_icecube_event(all_data, 0)
df_event = pd.DataFrame(event)
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
df_event.head()
Loading in the data... Building the indices... Built the indices! Data is read in and input file is closed.
| _SINGLETON_/INDEX | hits/nhits | hits/q | hits/t | hits/x | hits/y | hits/z | q | t | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 844 | 1.025 | 20513.0 | -9.130000 | -481.739990 | 158.820007 | 823.414653 | 0.041145 | -92.442496 | -1.751982 | 5.314192 |
| 1 | 1 | 844 | 0.375 | 20538.0 | -9.130000 | -481.739990 | 158.820007 | 2250.666667 | 0.041095 | -92.442496 | -1.751982 | 5.314192 |
| 2 | 1 | 844 | 0.925 | 8724.0 | 114.389999 | -461.989990 | 498.660004 | 912.432421 | 0.096745 | 7.378267 | -1.826879 | 1.692536 |
| 3 | 1 | 844 | 0.825 | 10320.0 | 361.000000 | -422.829987 | 414.410004 | 1023.030318 | 0.081783 | 2.337950 | -1.996074 | 2.036630 |
| 4 | 1 | 844 | 0.425 | 11294.0 | -211.350006 | -404.480011 | 498.630005 | 1985.882297 | 0.074730 | -3.993376 | -2.086630 | 1.692638 |
This section of the code initializes the df_all_events dataframe which is the amalgamation of all the events into one frame. This allows us to check for different neutrinos that are observed in all the events and cross reference them against each other directly
#CITE: chatGPT assistance
all_events = []
for i in range(10):
#Load in event i from the entire event library
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
#Engineer the non-ratio features (i.e. get rid of hits per unit parameter)
df_event["q"] = df_event["hits/nhits"] / df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"] / df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"] / df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"] / df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"] / df_event["hits/z"]
#Index the event number
df_event["event"] = i + 1
#Make a master list
all_events.append(df_event)
#Concatenate the master list into a master dataframe that include the data of all events 1-10
df_all_events = pd.concat(all_events, ignore_index=True)
#Use the groupby feature to make it easy to group by the event and calculate the mean and standard deviation for each event optical parameter and put it in a lis
means = df_all_events.groupby("event")["hits/q"].mean().tolist()
stds = df_all_events.groupby("event")["hits/q"].std().tolist()
ind = df_all_events.groupby("event")["q"].mean().tolist()
st = df_all_events.groupby("event")["q"].std().tolist()
index = list(range(1, 11))
The columns on the far right are engineered from the number of hits and the ratio of hits. How this was done is by dividing through with the specific ratio parameter (i.e. hits/z) and then multiplying by the hits value corresponding to that event. In this way, one can transform into a different unit space and examine a more physical parameter such as z or q. This ended up not being too important but served as an interesting lens to exmaine the data
With the data read in, we can now explore some initial plots and decide how to approach the analysis. There are a few things we are looking for. Since we want to classify the flavour of the neutrino based on the optical parameter q as well as the hits per unit time, we need to be able to;
Why are these 2 items important. These parameters are the quantifiers of the relative energy of the products created during fundamental particle interactions. The optical parameter, according to simulations, has shown to have distint signatures that are like fingerprints for the specific neutrino flavour. In other words, one can directly determine the flavour of the neutrino from its unique optical signal it produces. With this, one would instincitvley search for these high energy events in the data (which is shown below).
The first thing we'd like to do is really understand the data and how it is organized. The cell below checks for null values and performs simple deductive statistics on the data.
#CITE: Homework2 data examination
event = pps.get_icecube_event(all_data,0)
event_data = pd.DataFrame(event)
event_data.info()
print(event_data.describe())
print("The shape of one event of IceCube data (in this case event 0) is ",event_data.shape)
nevents = pps.get_number_of_entries(all_data)
print("There are {0} events in this dataset".format(nevents))
print("The number of sample points for this particular event is ", event_data.size)
mean_hitsq = event_data.iloc[:,2].mean()
median_hitsq = event_data.iloc[:,2].median()
mode_hitsq = event_data.iloc[:,2].mode()
std_hitsq = event_data.iloc[:,2].std()
var_hitsq = event_data.iloc[:,2].var()
print("Some statistics that can be examined for the hits-q (which is a measure of the optical signal) are as follows, ")
print("Mean:",mean_hitsq)
print("Median:",median_hitsq)
print("Mode:",mode_hitsq[0])
print("Standard Deviation:",std_hitsq)
print("Variance:",var_hitsq)
print("___________________________________________")
print("For the df_all_events dataframe...")
print(df_all_events.describe().T)
group=df_all_events.groupby(['event','hits/nhits'])
group.mean()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 844 entries, 0 to 843
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _SINGLETON_/INDEX 844 non-null int64
1 hits/nhits 844 non-null int64
2 hits/q 844 non-null float32
3 hits/t 844 non-null float32
4 hits/x 844 non-null float32
5 hits/y 844 non-null float32
6 hits/z 844 non-null float32
dtypes: float32(5), int64(2)
memory usage: 29.8 KB
_SINGLETON_/INDEX hits/nhits hits/q hits/t hits/x \
count 844.0 844.0 844.000000 844.000000 844.000000
mean 1.0 844.0 0.983021 13589.759766 17.928196
std 0.0 0.0 1.058903 2464.109131 108.427231
min 1.0 844.0 0.023053 6008.000000 -481.600006
25% 1.0 844.0 0.475000 12582.094727 -10.970000
50% 1.0 844.0 0.875000 13210.000000 31.250000
75% 1.0 844.0 1.175000 13657.000000 72.370003
max 1.0 844.0 14.921871 25479.000000 576.369995
hits/y hits/z
count 844.000000 844.000000
mean -8.074780 -237.688004
std 74.948410 208.913879
min -481.739990 -503.630005
25% -60.470001 -361.149994
50% 6.720000 -290.160004
75% 27.090000 -197.880005
max 463.720001 505.720001
The shape of one event of IceCube data (in this case event 0) is (844, 7)
There are 10 events in this dataset
The number of sample points for this particular event is 5908
Some statistics that can be examined for the hits-q (which is a measure of the optical signal) are as follows,
Mean: 0.9830211400985718
Median: 0.875
Mode: 1.025
Standard Deviation: 1.0589032173156738
Variance: 1.1212760210037231
___________________________________________
For the df_all_events dataframe...
count mean std min \
_SINGLETON_/INDEX 12414.0 1.000000 0.000000 1.000000
hits/nhits 12414.0 1422.352183 488.859518 697.000000
hits/q 12414.0 1.006837 1.271385 0.023053
hits/t 12414.0 13110.463867 2693.843994 5629.000000
hits/x 12414.0 3.634275 147.579620 -570.900024
hits/y 12414.0 -38.538662 169.346558 -521.080017
hits/z 12414.0 -171.131622 244.023712 -510.570007
q 12414.0 2564.506975 3036.542128 11.133595
t 12414.0 0.111840 0.042349 0.025048
x 12414.0 -18.440181 104.314642 -236.473162
y 12414.0 41.530080 91.870028 -138.397433
z 12414.0 -0.343303 28.305936 -242.857133
event 12414.0 5.906235 2.746974 1.000000
25% 50% 75% max
_SINGLETON_/INDEX 1.000000 1.000000 1.000000 1.000000
hits/nhits 844.000000 1502.000000 1799.000000 2159.000000
hits/q 0.525000 0.877201 1.197482 62.603321
hits/t 11940.829102 12807.194824 13465.000000 32831.429688
hits/x -10.970000 35.540001 79.410004 576.369995
hits/y -79.500000 6.720000 35.490002 509.500000
hits/z -350.059998 -227.089996 4.460000 512.950012
q 1018.153601 1629.433904 2789.576106 93652.764133
t 0.068302 0.116846 0.143110 0.380978
x -65.503643 7.368142 19.880650 1262.573071
y -11.443810 8.259871 49.780956 618.624640
z -6.432273 -3.812780 1.659338 449.700610
event 4.000000 6.000000 9.000000 10.000000
| _SINGLETON_/INDEX | hits/q | hits/t | hits/x | hits/y | hits/z | q | t | x | y | z | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| event | hits/nhits | |||||||||||
| 1 | 844 | 1.0 | 0.983022 | 13589.749023 | 17.928259 | -8.074822 | -237.687881 | 1610.596205 | 0.063757 | -19.523140 | 42.385751 | -1.935380 |
| 2 | 800 | 1.0 | 0.990026 | 13571.052734 | 19.517662 | -8.888537 | -229.903732 | 1534.719519 | 0.060555 | -15.512202 | 35.838132 | -1.817593 |
| 3 | 917 | 1.0 | 0.928878 | 13113.987305 | -59.272289 | 50.967949 | -226.473145 | 1584.537348 | 0.072948 | -1.330183 | 20.971932 | 1.582592 |
| 4 | 1302 | 1.0 | 0.989285 | 13204.146484 | -44.376842 | 38.416965 | -201.068817 | 2383.310279 | 0.102313 | -19.111217 | 48.753177 | 0.174137 |
| 5 | 2159 | 1.0 | 0.948680 | 13380.733398 | 24.405081 | -7.232395 | -214.480164 | 3936.944966 | 0.165799 | -33.334319 | 81.130253 | -4.030101 |
| 6 | 1560 | 1.0 | 0.943738 | 12960.739258 | -17.750942 | 38.081303 | -180.701965 | 2738.057229 | 0.123935 | -9.371805 | 36.010840 | 0.466757 |
| 7 | 697 | 1.0 | 1.426697 | 11469.996094 | 7.771450 | -394.524750 | 165.350250 | 1135.115307 | 0.062160 | -24.704954 | 2.485091 | 5.097271 |
| 8 | 834 | 1.0 | 0.912425 | 14003.599609 | 75.083023 | -61.713070 | -229.410278 | 1508.803047 | 0.062309 | 7.824406 | 0.688895 | -1.717514 |
| 9 | 1799 | 1.0 | 0.957915 | 13198.725586 | -11.395954 | 33.099060 | -208.950974 | 3315.569639 | 0.140868 | -7.788139 | 55.207786 | 0.172943 |
| 10 | 1502 | 1.0 | 1.157328 | 12438.828125 | 35.930401 | -225.118484 | -48.943596 | 2601.278090 | 0.123837 | -41.699772 | 23.595114 | 1.790371 |
With this data, we are able to actually reconstruct the event and make a 3D interactive plot of the event detection that occurred. This is supplied by Particle Physics Playground. and shown below
#CITE Particle Physics Playground introduction to the IceCube experiment notebook
fig = pps.display_icecube_event(event_data)
fig
#
here
#CITE: https://python-graph-gallery.com/11-grouped-barplot/
#This visualizes the standard deviation and means of the optical parameter and
#hits per unit time parameter across each of the events
stds = []
means = []
index = []
ind = []
st = []
for i in range(0,10):
#Loops over all of the events storred in the larger data library
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
#Engineering the non hit-frequency values (i.e. getting rid of the hits per parameter and turning into real units)
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
#Taking the mean of hits per optical signal parameter and the optical parameter alone
means.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'].mean())
stds.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'].std())
ind.append(df_event["q"].mean())
st.append(df_event["q"].std())
index.append(i+1)
#CITE AI USAGE - this block simply spaces out the bars of the mean and standard deviation bar plots so that they can become a grouped bar plot.
barWidth = 0.4
r = np.arange(len(means))
r2 = r + barWidth
r3 = r2 + barWidth
###
fig, ax = plt.subplots(dpi=100)
ax.bar(r, means, color='red', width=barWidth, edgecolor='white', label='Optical Signal Mean')
ax.bar(r2, stds, color='blue', width=barWidth, edgecolor='white', label='Optical Signal std')
#Labelling the plot and making sure the labels are large and readable
ax.set_xlabel('Event Number', fontweight='bold', fontsize=15)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=15)
ax.set_xticks(r + barWidth)
ax.set_xticklabels(index, fontsize=15)
ax.tick_params(axis='y', labelsize=15)
plt.title('Optical Signal Mean and Std for All Events', fontweight='bold', fontsize=15)
ax.legend(fontsize=15)
plt.show()
#This is a repeat of above, but just using the hits/t column data instead
tstds = []
tmeans = []
tindex = []
tind = []
tst = []
for i in range(0,10):
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
tmeans.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'].mean())
tstds.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'].std())
tind.append(df_event["q"].mean())
tst.append(df_event["q"].std())
tindex.append(i+1)
#CITE AI USAGE - this block simply spaces out the bars of the mean and standard deviation bar plots so that they can become a grouped bar plot.
barWidth = 0.4
r = np.arange(len(means))
r2 = r + barWidth
r3 = r2 + barWidth
###
fig, ax = plt.subplots(dpi=100)
ax.bar(r, tmeans, color='red', width=barWidth, edgecolor='white', label='Hits Per Time Mean')
ax.bar(r2, tstds, color='blue', width=barWidth, edgecolor='white', label='Hits per Time std')
ax.set_xlabel('Event Number', fontweight='bold', fontsize=14)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=14)
ax.set_xticks(r + barWidth)
ax.set_xticklabels(tindex, fontsize=14)
ax.tick_params(axis='y', labelsize=14)
plt.title('Hits per Time Mean and Std for All Events', fontweight='bold', fontsize=15)
ax.legend()
plt.show()
These plots show that for the mean and standard deviation of the optical parameter measurement, it shows a clear spike in optical intensity in the events 7 and 10. This may indiciate a difference in the neutrino flavour that caused them! Let's examine further and see what else we can identify
This plot visualizes the histogram distrubutions of the optical parameter measurement amongst individual events to examine the spread of the optical signal in the IceCube experiment. This it applied for both the hits/q value which is provided by Particle Physics Playground as well as an engineered optical measurement column (q) which I designed by dividing out the number of hits from nhits described above.
###CITE IceCube experiment notebook for the histogram plot layout
plt.figure(figsize=(20,8))
for i in range(0,10):
plt.subplot(2, 5, i + 1)
event = pps.get_icecube_event(all_data,i)
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
q = event['hits/q']
q_mean = np.mean(q)
q_std = np.std(q)
#plt.title("Event:"+str(i+1), fontsize=16)
plt.hist(q,bins=75,label=f"Mean:{q_mean:.2f} Std:{q_std:.2f}")
plt.xticks(rotation=35, fontsize=15)
plt.yticks(rotation=35, fontsize=15)
plt.legend(fontsize=12)
plt.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
plt.figtext(0.5, -0.03, 'Optical Measurement (q)', ha='center', fontsize=18, fontweight='bold')
plt.figtext(-0.03, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=18, fontweight='bold')
plt.figtext(0.3, 1.05, 'hits/q Optical Parameter Histograms for Events 1-10', va='center', fontsize=25, fontweight='bold')
plt.tight_layout()
###
print(
)
###CITE IceCube experiment notebook for the histogram plots layout
plt.figure(figsize=(20,8))
for i in range(0,10):
plt.subplot(2, 5, i + 1)
event = pps.get_icecube_event(all_data,i)
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
#plt.title(f"Event: {i+1}", fontsize=16)
plt.hist(df_event['q'],bins=75,label=f"Mean: {df_event['q'].mean():.2f} Std: {df_event['q'].std():.2f}", color = 'blue');
plt.xticks(rotation=35, fontsize=13)
plt.yticks(rotation=35, fontsize=13)
plt.legend(fontsize=12)
plt.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
plt.figtext(0.5, -0.03, 'Optical Measurement (q)', ha='center', fontsize=20, fontweight='bold')
plt.figtext(-0.03, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=20, fontweight='bold')
plt.figtext(0.3, 1.05, 'q Optical Parameter Histograms for Events 1-10', va='center', fontsize=25, fontweight='bold')
plt.tight_layout()
###
###
These plots show the scatters of the hits/q parameter against other variables such as hits/t, hits/x,y,z etc. It also includes a heatmap of the number of detected hits during the event which is plotted along with the data in the scatter plot
#AI assistance: initializing figure and creating a subplot
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
###
choice = [3,7] #Choice of the events that I wish to look at
for i in choice:
#Make a scatter of hits/q as a function of hits/y
plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/y'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/t}$',fontsize=18)
#ax.set_ylim(0, 2.50) # **Set y-axis limits**chatgpt
plt.legend()
plt.title('Colour Map of hits/t Overlaid on Scatter of hits/q as a function of hits/y)')
plt.show()
#Same plot as above, just zoomed in
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7] #Choice of the events I want to look at
for i in choice:
#potting the hits/q as a function of hits/y
plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/y'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylim(0, 2.50) # **Set y-axis limits**chatgpt
plt.legend()
plt.title('Colour Map of hits/t Overlaid on Scatter of hits/q as a function of hits/y (Zoomed in)')
plt.show()
#################################################
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
plt.scatter(df_event['y'], df_event['q'],marker='o',s=40,c=df_event['t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{t}$',fontsize=18)
#ax.set_ylim(0, 2.50) # **Set y-axis limits** chatchpt
plt.legend()
plt.title('Colour Map of t Overlaid on Scatter of q as a function of y)')
plt.show()
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
plt.scatter(df_event['y'], df_event['q'], marker='o', s=40, c=df_event['t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{t}$',fontsize=18)
plt.legend()
ax.set_xlim(-75, 75) # **Set y-axis limits** Chatgpt
ax.set_ylim(0, 10000) # **Set y-axis limits** chatgpt
plt.title('Colour Map of t Overlaid on Scatter of q as a function of y (Zoomed in)')
plt.show()
The plots above are looking at events 7 and 3 to compare the differences in the optical parameter and hits/t parameter. There is a large density of points with some extreme outliers. If we zoom into the high density regions we can see some natural clustering of the points.
#Same functional layout of the code as the plots above, they just look at different plotted parameters
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
plt.legend()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
#ax.set_ylim(0, 2.50)
plt.legend()
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t')
plt.show()
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
plt.legend()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
ax.set_ylim(0, 2.50)
plt.legend()
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t (ZOOMED IN)')
plt.show()
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
plt.scatter(df_event['t'], df_event['q'],marker='o',s=40,c=df_event['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{nhits}$',fontsize=18)
#ax.set_ylim(0, 2.50)
plt.legend()
plt.title('Colour Map of t Overlaid on Scatter of q as a function of nhits)')
plt.show()
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
plt.scatter(df_event['t'], df_event['q'], marker='o', s=40, c=df_event['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{nhits}$',fontsize=18)
plt.legend()
ax.set_xlim(0, 0.15)
ax.set_ylim(0, 12000)
plt.title('Colour Map of t Overlaid on Scatter of q as a function of y (Zoomed in)')
plt.show()
We can also look at some box plots and violin plots which reflect the number of outliers in the data. One will be able to see an abundance of extreme values in the optical parameter that may indicate high-energy interactions within the ice. This will be a point of examination.
fig1 = plt.figure(figsize=(9,4)) #initialize the plot
ax = fig1.gca()
#Make a boxplot of the hits/q data to examine the abundance of outliers in the data and the distrubution of optical signal values across events
ax = sns.boxplot(x="event", y="hits/q", data=df_all_events,notch=True, saturation=0.5, medianprops=dict(color='red'), linewidth=2) #CITE: Chatgpt: Changing the median line inside the interquartile box so it is more noticeable
ax.set_title('Hits/q Boxplot', fontweight='bold', fontsize=20)
ax.set_xlabel('Event Number', fontweight='bold', fontsize=20)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=20)
ax.tick_params(axis='x', labelsize=18)
ax.tick_params(axis='y', labelsize=18)
plt.show()
#Make a violin of the hits/q data to examine the abundance of outliers in the data and the distrubution of optical signal values across events
fign = plt.figure(figsize=(9,4))
ax = fign.gca()
sns.violinplot(x="event", y="hits/q", data=df_all_events, color="skyblue", inner="quartile") #CITE: AI Assistance: Creating a violin plot to similarly showcase the abundance of outlier values; very similar to the box plot
ax.set_title("Hist/q Violin Plot", fontweight='bold', fontsize=16)
ax.set_xlabel('Event Number', fontweight='bold', fontsize=14)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=14)
ax.tick_params(axis='y', labelsize=14)
ax.tick_params(axis='x', labelsize=14)
plt.show()
We will not try to visualize the correlation between the parameters in the dataset using the pearson correlation matrix given by; $$ \text{pearson correlation}\;r = cor(X, Y) =\frac{cov(X, Y)}{std(X)std(Y)}$$This method will allow us to quantify how linearly related or not different parameters of the dataset are from one another. We will also plot it using a correlation heatmap and then tighten the scope of the analysis to only look at the more relevant parameters
corr = df_all_events.corr() #Calculates the correlation of the variables in the data set
f, ax = plt.subplots(figsize=(8, 8))
#Visualizes it using the seaborn library heatmap plot tool
ax = sns.heatmap(data=corr, annot=True, fmt=".2f", square=True, ax=ax)
plt.title('Correlation Heatmap',size=16)
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=10, rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=10, rotation=0)
plt.show()
params = ['hits/q', 'hits/t', 'hits/x', 'hits/y', 'hits/z'] #reduces the scope of the plot to the most important and defining variables
corr = df_all_events[params].corr()
f, ax = plt.subplots(figsize=(8, 8))
sns.set(font_scale=1.5)
ax = sns.heatmap(data=corr, annot=True, fmt=".2f", square=True, ax=ax)
plt.title('Correlation Matrix Heatmap',size=25)
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=20, rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=20, rotation=0)
plt.show()
What this shows is that the data is not strongly linearly correlated. However, there are some interesting variables which show a strong anti-linear correlation. hits/x and hits/z, and hits/t and hits/z as well as hits/y and hits/z show anti-linear correlation. Let's quickly visualize hits/x and hits/z to show this relationship
#More exploratory plots, with the same functional form as above.
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
plt.scatter(df_all_events['hits/z'],df_all_events['hits/x'],marker='o',s=40,c=df_all_events['hits/nhits'],alpha=0.5) #Visualized using a heatmap
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
#ax.set_ylim(0, 2.50)
#ax.set_xlim(4200, 15000)
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/z as a function of hits/z')
plt.show()
These plots are similar to the ones above, however it plots ALL of the events onto one plot and zooms into the high density features of the data. It shows that there are clusters for the different hits/t values which indicates fast moving particles that are triggering many detectors per unit time. This may also be an indication of a high energy event.
#More exploratory plots using the same functional form as above
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
#This is zoomed in to examine the separation of cluster groups
plt.scatter(df_all_events['hits/t'],df_all_events['hits/q'],marker='o',s=40,c=df_all_events['hits/nhits'],alpha=0.5)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
ax.set_ylim(0, 2.50)
ax.set_xlim(4200, 15000)
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t (ZOOMED IN)')
plt.show()
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
#All of the events (0-10) optical data is plotted to observe differences in the optical intensity for all different classes of neutrinos togehter
for i in range(0,10):
plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/nhits'],label=f'Event {i}', alpha=0.5)
cbar=plt.colorbar()
plt.legend()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
ax.set_ylim(0, 10)
#ax.set_xlim(4200, 15000)
plt.legend()
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t (ZOOMED IN)')
plt.show()
After examining these pieces of evidence to be able to classify neutrinos based on the optical parameter and the hits/t, I am most interested in the natural clustering in the hits/t parameter which indicates very fast moving particles that trigger many detectors. Thus, below begins the steps to use a Random Forest Regression algorithm to be able to predict the values of the hits/q and hits/t values using a 70/30 train test split
First I will maximize the cross validation accuracy as a function of the number of trees used to predict these parameters
#Takes the data and makes a target and feature matrix
Xkss = df_all_events[[c for c in df_all_events.columns if c not in ['q', 'hits/q', '_SINGLETON_/INDEX', 'event']]]
ykss = df_all_events['hits/q'].values
# Standardize features!!! Very important so that we are comparing quantities on different scales
ss = StandardScaler()
Xsss = ss.fit_transform(Xkss)
n_trees = range(10, 100, 10)
r2_scores = []
for n in n_trees:
rf = RandomForestRegressor(n_estimators=n, random_state=42) #CITE CHATGPT: Initializes the random forest regression model with n trees that loops each time to find the best cv score
scores = cross_val_score(rf, Xsss, ykss, cv=10, scoring='r2')
r2_scores.append(np.mean(scores))
plt.figure(figsize=(8, 5))
plt.plot(n_trees, r2_scores)
plt.xlabel("Number of Trees")
plt.ylabel("Cross-Validation R2 Score")
plt.title("Random Forest Regression Performance vs. Number of Trees")
plt.show()
print(f"Best R2 Score: {max(r2_scores):.4f} at {n_trees[np.argmax(r2_scores)]} trees")
Best R2 Score: 0.6920 at 80 trees
# make a 70/30 train test split: NOTE Xsss is already scaled using the standard scaler - a very important part!
X_train, X_test, y_train, y_test = train_test_split(Xsss, ykss, test_size=0.3, random_state=42)
best_n_trees = 80 #Takes the best tree number from the previous part to optimize the regression model
rf_best = RandomForestRegressor(n_estimators=best_n_trees, random_state=42) #CITE: GPT: Creates the random forest regression model used
rf_best.fit(X_train, y_train) #Fitting to the data using the model creating with the optimized tree number
y_pred = rf_best.predict(X_test) #Prediction of the hits/q parameter using the model trained
score = rf_best.score(X_test, y_test) #scoring the model on one fold of data with r2
score_CV = cross_val_score(rf_best, Xsss, ykss, cv=10, scoring='r2') #Using a cross validation score across 10 folds
r2 = r2_score(y_test, y_pred) #scoring the model on one fold of data with r2
mse = mean_squared_error(y_test, y_pred) #Mean squared error to compute the average squared residual value
#Plot the predicted fit vs the actual data
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.6, color='blue', label="Predicted vs Actual")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--r', label="Perfect Fit") #CITE: GPT: Just makes the diagonal line which represents the perfect fit
plt.xlabel("Actual 'hits/q'", fontweight='bold', fontsize=20)
plt.ylabel("Predicted 'hits/q'", fontweight='bold', fontsize=20)
plt.title(f"Actual vs. Predicted (Best RF Model): {best_n_trees} Trees)", fontweight='bold', fontsize=22)
plt.legend()
plt.show()
print(f"The r2 score for this fit is {score:.2} and has a Mean Squared Error of {mse:.2}")
print(f"The cross value score is {np.mean(score_CV):.2}")
The r2 score for this fit is 0.42 and has a Mean Squared Error of 0.6 The cross value score is 0.69
This section now uses a K-mean clustering to group the data into 3 distinct events correlating to the different neutrino flavours that caused them. As explained in the paper, group 0 clusters corresond with the high energy muon neutrinos, group 1 represents the tau neutrino group and group 2 is the electron neutrino group. This labelling is consistent in both k-mean clustering tasks
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning, module="sklearn")
#CITE PLOT DESIGN FROM: https://seaborn.pydata.org/generated/seaborn.jointplot.html
initial_centroids = np.array([[10700, 0.99], [13000, 1.2], [20000, 1.2]]) #Specifying the location of the centroids based on EDA analysis
dfX=pd.DataFrame({'hits/t': df_all_events['hits/t'], 'hits/q': df_all_events['hits/q']}) #Make a new dataframe to be plotted
# Drop rows with NaN values in either 'hits/t' or 'hits/q' columns
#dfX = dfX.dropna(subset=['hits/t', 'hits/q'])
model = KMeans(n_clusters=3, init=initial_centroids, random_state=42).fit(dfX) #Creates the kmeans clustering with three clusters centred as the location of the centroids
#Predicts the location of the clusters
predicted = model.labels_
centroids = model.cluster_centers_
dfX['predicted'] = predicted
print("Location of centroids: ")
print(centroids)
df_all_events.copy()
#Plots the actual clusters using the seaborn plotting style of jointplot
dfX['hits/t']=df_all_events['hits/t']#Categorizes the independent variable in the mass events dataframe by the clusters
dfX['hits/q']=df_all_events['hits/q']#Categorizes the dependent variable hits/q in the mass events dataframe by the clusters
cmap = plt.cm.get_cmap('Set1', len(dfX['predicted'].unique())) #Creates a colour map based on the predicted clusters
plts = sns.jointplot(data=df_all_events, x='hits/t', y='hits/q', hue=dfX['predicted'], palette='Set1')
plts.ax_joint.set_xlabel('hits/t', fontsize=22, fontweight='bold')
plts.ax_joint.set_ylabel('hits/q', fontsize=22, fontweight='bold')
plts.ax_joint.tick_params(axis='both', labelsize=18)
plt.figtext(0.6, 1, 'K-Means Clustering Jointplot Using hits/q as a Function of hits/t', ha='center', fontsize=20, fontweight='bold')
plt.yscale('log')
plt.show()
Location of centroids: [[1.0676513e+04 1.1690106e+00] [1.3158599e+04 9.4351619e-01] [2.1319223e+04 1.0510709e+00]]
score = silhouette_score(dfX, predicted, metric='euclidean')#how disernable the groups are
score
0.6525506774802826
I also do this for the hits/z and hits/q parameters, since one might expect higher energy meuon neutrinos to be able to penetrate deeper into the ice, as opposed to background electron and tau neutrinos
#CITE PLOT DESIGN FROM: https://seaborn.pydata.org/generated/seaborn.jointplot.html
dfXx=pd.DataFrame({'hits/z': df_all_events['hits/z'], 'hits/q': df_all_events['hits/q']})
# Drop rows with NaN values in either 'hits/t' or 'hits/q' columns
#dfX = dfX.dropna(subset=['hits/t', 'hits/q'])
modelz = KMeans(n_clusters=3, init=initial_centroids, random_state=42).fit(dfXx)
#Same functional form as above!
predictedz = modelz.labels_
centroidsz = modelz.cluster_centers_
dfXx['predicted'] = predictedz
print("Location of centroids: ")
print(centroidsz)
df_all_events.copy()
dfXx['hits/z']=df_all_events['hits/z']
dfXx['hits/q']=df_all_events['hits/q']
cmap = plt.cm.get_cmap('Set1', len(dfXx['predicted'].unique()))
plts = sns.jointplot(data=df_all_events, x='hits/z', y='hits/q', hue=dfXx['predicted'], palette='Set1')
plts.ax_joint.set_xlabel('hits/z', fontsize=22, fontweight='bold')
plts.ax_joint.set_ylabel('Log Axis hits/q', fontsize=22, fontweight='bold')
plts.ax_joint.tick_params(axis='both', labelsize=18)
plt.figtext(0.6, 1, 'K-Means Clustering Jointplot Using hits/q as a Function of hits/z', ha='center', fontsize=20, fontweight='bold')
plts.ax_joint.legend(title='Cluster', fontsize=12, title_fontsize=13) #CITE CHATGPT: Makes the legend smaller so it is not blocking points
plt.yscale('log')
plt.show()
Location of centroids: [[-184.30653 1.0682168] [-376.03217 0.8831614] [ 203.79977 1.1407628]]
score = silhouette_score(dfXx, predicted, metric='euclidean') #how disernable the groups are
score
0.3480709722957436
Separating the large daaset into groups based on the K-means clustering allows us to create 3 subsets of the data that can each have a random forest regression run on it. In this way, not only have we separated out the data presumably corresponding to each neutrino event, but also are able to now predict how each neutrino flavour will behave. This is a step up from the previous regression which just simply tries to predict different physical phenomena all at once.
This first regression uses the hits/t vs hits/q clustering algorithm results to create subsets, each of which are then predicted using a random forest algorithm
#Same functional form as above!!!: only difference is that it loops over the subsets defined by the clusters
#these subsets correspond to the different neutrino flavours
df_all_events.copy()
df_all_events['cluster'] = predicted
# Makes subsets of the data based on how the k means clustering predicted the groups
cluster_A = df_all_events[df_all_events['cluster']==0]
cluster_B = df_all_events[df_all_events['cluster']==1]
cluster_C = df_all_events[df_all_events['cluster']==2]
groups = [cluster_A, cluster_B, cluster_C]
l=0
#loops over the clusters 0,1 and 2 corresponding to the different flavours determined by kmeans
for i in groups:
Xkss = i[[c for c in i.columns if c not in ['q', 'hits/q', '_SINGLETON_/INDEX', 'event']]]
ykss = i['hits/q'].values
# Standardize features!!! Very important so that we are comparing quantities on different scales
ss = StandardScaler()
Xsss = ss.fit_transform(Xkss)
#To prevent over fitting and poor performance, the first group (i.e. group 0) uses a 40/60 train test split
#other groups use 70/30
if l == 0:
size = 0.6
else:
size=0.3
# make a 70/30 train test split: NOTE Xsss is already scaled using the standard scaler - a very important part!
X_train, X_test, y_train, y_test = train_test_split(Xsss, ykss, test_size=size, random_state=42)
best_n_trees = 80
rf_best = RandomForestRegressor(n_estimators=best_n_trees, random_state=42) #CITE: GPT: instantiates the random forest regression model used
rf_best.fit(X_train, y_train)
y_pred = rf_best.predict(X_test)
score = rf_best.score(X_test, y_test)
score_CV = cross_val_score(rf_best, Xsss, ykss, cv=10, scoring='r2')
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.6, color='blue', label="Predicted vs Actual")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--r', label="Perfect Fit") #CITE: GPT: Just makes the diagonal line which represents the perfect fit
plt.xlabel("Actual 'hits/q'", fontweight='bold', fontsize=20)
plt.ylabel("Predicted 'hits/q'", fontweight='bold', fontsize=20)
plt.title(f"Actual vs. Predicted Group {l}: {best_n_trees} Trees)", fontweight='bold', fontsize=22)
plt.legend()
plt.tight_layout()
plt.show()
print(f"The r2 score for this fit is {score:.2} and has a Mean Squared Error of {mse:.2}")
print(f"The cross value score is {np.mean(score_CV):.2}")
print(f"The mean optical signal for this group is {np.mean(y_pred)}")
l=l+1
The r2 score for this fit is 0.23 and has a Mean Squared Error of 2.9 The cross value score is 0.54 The mean optical signal for this group is 1.1848178403079106
The r2 score for this fit is 0.46 and has a Mean Squared Error of 0.37 The cross value score is 0.76 The mean optical signal for this group is 0.9386202857034965
The r2 score for this fit is 0.41 and has a Mean Squared Error of 0.47 The cross value score is 0.59 The mean optical signal for this group is 1.069292957918956
This second regression uses the hits/z vs hits/q clustering algorithm results to create subsets, each of which are then predicted using a random forest algorithm
#Same functional form as above, just using the hits/z feature columns instead
df_all_events.copy()
df_all_events['cluster'] = predictedz
cluster_A = df_all_events[df_all_events['cluster']==0]
cluster_B = df_all_events[df_all_events['cluster']==1]
cluster_C = df_all_events[df_all_events['cluster']==2]
groups = [cluster_A, cluster_B, cluster_C]
l=0
for i in groups:
Xkss = i[[c for c in i.columns if c not in ['q', 'hits/q', '_SINGLETON_/INDEX', 'event']]]
ykss = i['hits/q'].values
# Standardize features!!! Very important so that we are comparing quantities on different scales
ss = StandardScaler()
Xsss = ss.fit_transform(Xkss)
if l == 0:
size = 0.6
else:
size=0.3
# make a 70/30 train test split: NOTE Xsss is already scaled using the standard scaler - a very important part!
X_train, X_test, y_train, y_test = train_test_split(Xsss, ykss, test_size=size, random_state=42)
best_n_trees = 80
rf_best = RandomForestRegressor(n_estimators=best_n_trees, random_state=42) #CITE: GPT: Creates the model used
rf_best.fit(X_train, y_train)
y_pred = rf_best.predict(X_test)
score = rf_best.score(X_test, y_test)
score_CV = cross_val_score(rf_best, Xsss, ykss, cv=10, scoring='r2')
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.6, color='blue', label="Predicted vs Actual")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--r', label="Perfect Fit") #CITE: GPT: Just makes the diagonal line which represents the perfect fit
plt.xlabel("Actual 'hits/q'", fontsize=20, fontweight='bold')
plt.ylabel("Predicted 'hits/q'", fontsize=20, fontweight='bold')
plt.title(f"Actual vs. Predicted Group {l}: {best_n_trees} Trees)", fontsize=22, fontweight='bold')
plt.legend(fontsize=15)
plt.show()
print(f"The r2 score for this fit is {score:.2} and has a Mean Squared Error of {mse:.2}")
print(f"The cross value score is {np.mean(score_CV):.2}")
print(f"The mean optical signal for this group is {np.mean(y_pred)}")
l=l+1
The r2 score for this fit is 0.3 and has a Mean Squared Error of 0.96 The cross value score is 0.85 The mean optical signal for this group is 1.077547095455038
The r2 score for this fit is 0.53 and has a Mean Squared Error of 0.22 The cross value score is 0.7 The mean optical signal for this group is 0.8952648581447339
The r2 score for this fit is 0.39 and has a Mean Squared Error of 3.5 The cross value score is 0.38 The mean optical signal for this group is 1.1847571540733952
Since we have used k-menas clustering twice using two different independent variable parameters, and have grouped 3 clusters in each, we have created pseudo labels for our data. In other words, it would seem like astrophysical muon neutrinos with high energy (i.e. high hits/q values) would be in group 0 in the hits/t dependent variable clustering, as well as group 0 in the hits/z clustering. Thus, we can compare how often an index classified in group 0 for the hits/t clustering is inside group 0 for the hits/z clustering and see if our hypotesis matches.
#CITE CHART DESIGN: https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
#Link the cluster group labels from the hits/t and hits/z feature columns with their respective points in the actual raw data set
df_all_events.copy()
df_all_events['cluster_t'] = predicted
df_all_events['cluster_z'] = predictedz
#Compare how often group association matches between both of the kmeans clustering algorithms via a confusions matrix
conf_mat = confusion_matrix(df_all_events['cluster_t'], df_all_events['cluster_z'])
sns.set(font_scale=1.5)#AI useage: makes the font bigger
#This is visualized using the chart design from the link above
sns.heatmap(conf_mat/np.sum(conf_mat), annot=True,
fmt='.2%', cmap='Blues')
plt.title("Confusion Matrix Between hits/t and hits/z Classifier", fontweight='bold')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
(array([0.5, 1.5, 2.5]), [Text(0, 0.5, '0'), Text(0, 1.5, '1'), Text(0, 2.5, '2')])
While the group 2 does not work as accuratley, our predictions of group zero are quite accurate and lead to few false guesses which is promising for our method! The point of the confusion matrix was not to validate the accuracy of groups 1 and 2 since electron and tau neutrino signals are so similar, but the fact that our expectations for the observations of muon neutrinos are coincident for the hits/z and hits/t clusters is interesting.
[1]Confusion Matrix Visualization: https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
[2]Particle Physics Playgourn: https://sites.google.com/siena.edu/particle-physics-playground/icecube-introduction?authuser=0
[3]Jointplot Documentation: https://seaborn.pydata.org/generated/seaborn.jointplot.html
[4]Waskom, M., Botvinnik, Olga, O'Kane, Drew, Hobson, Paul, Lukauskas, Saulius, Gemperline, David C, … Qalieh, Adel. (2017). mwaskom/seaborn: v0.8.1 (September 2017). Zenodo. https://doi.org/10.5281/zenodo.883859
[5]Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2
[6]McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).
[7]Pedregosa, F., Varoquaux, Ga"el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
[8]ChatGPT
[9]Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021.
[10]Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830
[11] Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95
[12]Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R´ıo, J. F., Wiebe, M., Peterson, P., G´erard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825):357–362